Looking through glass: Knowledge discovery from materials science literature using natural language processing

نویسندگان

چکیده

•Natural language processing is used for information extraction from research papers•Caption cluster plots are exploring figure captions across the entire corpus•Elemental maps to identify chemical elements reported in a study•A framework extract domain-specific queries literature Most knowledge generated through scientific enquiry materials domain presented form of unstructured data. Among available sources such as online websites, digital data, and publications, peer-reviewed journals serve undisputed source reliable regarding synthesis, characterization, properties. Despite availability large only limited fraction compiled machine-readable databases, most which manually curated. Here, applying natural on corpus journal publications inorganic glasses, we present text images, answers related synthesis characterization techniques, even used. The scalable approach here can be applied other domains efficient retrieval literature. science data images. employing processing, automates image comprehension precision glasses’ abstracts automatically categorized using latent Dirichlet allocation (LDA) classify search semantically linked publications. Similarly, comprehensive summary images caption plot (CCP), providing direct access buried papers. Finally, combine LDA CCP with an elemental map, topical image-wise distribution occurring Overall, generic powerful tool disseminate material-specific composition–structure–processing–property dataspaces, allowing insights into fundamental problems relevant community accelerated discovery. overwhelmingly amount mostly stored texts These range expository archives, books, journals, dissertations, condensed representations, handbooks manuals. Materials science, being highly interdisciplinary area, commands repository However, this collected curated structured example, database composition–structure–property relationships. material increasingly siloed simply too utilization any one individual or group. Just all branches afflicted by curse incommensurate information.1De Guire E. Bartolo L. Brindle R. Devanathan Dickey E.C. Fessler J. French R.H. Fotheringham U. Harmer M. Lara-Curzio et al.Data-driven glass/ceramic research: glass ceramic science/informatics communities.J. Am. Ceram. Soc. 2019; 102: 6385-6406https://doi.org/10.1111/jace.16677Crossref Scopus (12) Google Scholar,2Rajan K. informatics: “gene” big data.Annu. Rev. Mater. Res. 2015; 45: 153-169https://doi.org/10.1146/annurev-matsci-070214-021132Crossref (173) Scholar Thus, accessibility vast majority limited, it (1) time-consuming read analyze (2) requires expert understand, interpret, summarize information. Recent advancements (NLP) provide promising solution problem automation comprehension, querying, texts. NLP has been extensively specifically biological sciences more than 2 decades.3Friedman C. Kra P. Yu H. Krauthammer Rzhetsky A. GENIES: natural-language system molecular pathways articles.Bioinformatics. 2001; 17: S74-S82https://doi.org/10.1093/bioinformatics/17.suppl_1.S74Crossref PubMed (377) Scholar, 4Lee Yoon W. Kim S. D. So C.H. Kang BioBERT: pre-trained biomedical representation model mining.Bioinformatics. 2020; 36: 1234-1240https://doi.org/10.1093/bioinformatics/btz682Crossref (985) 5Yandell M.D. Majoros W.H. Genomics processing.Nat. Genet. 2002; 3: 601-610https://doi.org/10.1038/nrg861Crossref (110) A biomedical-specific model, namely, BioBERT,4Lee mining, stands testimony advances contributions sciences. In contrast, applications remain sparse.6Kim Huang Kononova O. Ceder G. Olivetti Distilling ontology.Matter. 1: 8-12https://doi.org/10.1016/j.matt.2019.05.011Abstract Full Text PDF (21) 7Tshitoyan V. Dagdelen Weston Dunn Rong Z. Persson K.A. Jain Unsupervised word embeddings capture literature.Nature. 571: 95-98https://doi.org/10.1038/s41586-019-1335-8Crossref (337) 8Venugopal Bishnoi Singh Zaki Grover H.S. Bauchy Agarwal Krishnan N.M.A. Artificial intelligence machine learning technology: 21 challenges 21(st) century.Int. Appl. Glass Sci. 2021; https://doi.org/10.1111/ijag.15881Crossref (10) Similar sciences, study some unique application mine due jargons lack uniform conventions writing.6Kim Scholar,9Weston Tshitoyan Trewartha Named entity recognition normalization large-scale literature.J. Chem. Inf. Model. 59: 3692-3702https://doi.org/10.1021/acs.jcim.9b00470Crossref (60) these challenges, recent studies have shown that indeed address open novel discovery,7Tshitoyan unraveling pathways,10Kim Jensen van Grootel Staib Mysore Chang Strubell McCallum Jegelka Inorganic planning literature-trained neural networks.J. 60: 1194-1201https://doi.org/10.1021/acs.jcim.9b00995Crossref (45) extracting composition–property databases.11Huang Cole J.M. battery auto-generated ChemDataExtractor.Sci. Data. 7: 260https://doi.org/10.1038/s41597-020-00602-2Crossref (30) al12Beard E.J. ChemSchematicResolver: toolkit decode 2D diagrams labels R-groups annotated named entities.J. 2059-2072https://doi.org/10.1021/acs.jcim.0c00042Crossref 13Cole design-to-device pipeline data-driven discovery.Acc. 53: 599-610https://doi.org/10.1021/acs.accounts.9b00470Crossref (37) 14Court C.J. Yildirim B. 3-D crystal structure generation property prediction via learning.J. 4518-4535https://doi.org/10.1021/acs.jcim.0c00464Crossref (20) demonstrated automated databases magnetic15Court Auto-generated Curie Néel temperatures semi-supervised relationship extraction.Sci. 2018; 5: 180111https://doi.org/10.1038/sdata.2018.111Crossref (56) materials11Huang ChemDataExtractor,16Swain M.C. ChemDataExtractor: 2016; 56: 1894-1904https://doi.org/10.1021/acs.jcim.6b00207Crossref (159) also predicting phase diagrams.17Court Magnetic superconducting transition predicted mining learning.Npj Comput. 6: 18https://doi.org/10.1038/s41524-020-0287-8Crossref (29) al6Kim Scholar,10Kim Scholar,18Mahbub Hood Z.D. Rupp J.L.M. E.A. conditions solid-state electrolyte.Electrochem. Commun. 121: 106860https://doi.org/10.1016/j.elecom.2020.106860Crossref (18) together artificial networks predict parameters oxides10Kim Scholar,19Kim Virtual screening deep learning.NPJ 2017; 53https://doi.org/10.1038/s41524-017-0055-6Crossref (103) Scholar,20Kim Tomala Matthews Saunders Machine-learned codified oxide materials.Sci. 4: 170127https://doi.org/10.1038/sdata.2017.127Crossref (86) properties zeolites21Jensen Kwon Gani T.Z.H. Roman-Leshkov Y. Moliner Corma zeolite enabled automatic extraction.ACS Cent. 892-899https://doi.org/10.1021/acscentsci.9b00193Crossref (81) cementitious materials.22Traynor Uvegi Lothenbach Myers R.J. Methodology pH measurement high alkali systems.Cem. Concr. 135: 106122https://doi.org/10.1016/j.cemconres.2020.106122Crossref (16) al7Tshitoyan use vectors converting semantic vector algebra extended method thermoelectrics. al10Kim recipes oxides approach. Recently, Matscholar9Weston introduced discovery engine able materials, properties, methods, descriptors, given custom-built (NER) system. developments suggest approaches route condense represent leading development. Very few have, however, focused literature.23Mukaddem K.T. Beard ImageDataExtractor: quantify microscopy images.J. 2492-2509https://doi.org/10.1021/acs.jcim.9b00734Crossref (14) Scholar,24Tatum W.K. Torrejon O'Neil Onorato J.W. Resing A.B. Holliday Flagg L.Q. Ginger D.S. Luscombe C.K. Generalizable algorithmic interpretation thin film morphologies scanning probe 3387-3397https://doi.org/10.1021/acs.jcim.0c00308Crossref (9) adage, “a picture worth thousand words,” literature, hold crucial hypothesis theories.25Venugopal Broderick S.R. Rajan words: tools creating quantum map.MRS 9: 1134-1141https://doi.org/10.1557/mrc.2019.136Crossref (6) Till date, there no allows compilation Further, manuscript should conjunction understand context. While many textual information, effort made thus far connect allow dissemination holistic manner. demonstrate extracts specific, nuanced, exploration Specifically, approximately 100,000 articles area archetypical disordered material. Glasses common widely among engineering uses spanning architectural, functional, applications.26Bhaskar Kumar Maurya Ravinder Allu A.R. Das Gosvami N.N. Youngman R.E. Bødker M.S. Mascaraque N. al.Cooling rate effects 45S5 bioglass: experiments simulations.J. Non-cryst. Sol. 534: 119952https://doi.org/10.1016/j.jnoncrysol.2020.119952Crossref Scholar,27Varsheneya Mauro J.C. Fundamentals Glasses.Third Edition. Elsevier, 2019Google (ML) develop predictive models optical, electronic, mechanical glasses.26Bhaskar Scholar,28Alcobaça Mastelini S.M. Botari T. Pimentel B.A. Cassar D.R. de Carvalho A.C.P. Zanotto E.D. Explainable algorithms temperatures.Acta 188: 92-100https://doi.org/10.1016/j.actamat.2020.01.047Crossref 29Anoop N.M. Mangalathu Smedskjaer M.M. Tandia Burton Predicting dissolution kinetics silicate glasses 487: 37-45https://doi.org/10.1016/j.jnoncrysol.2018.02.023Crossref (66) 30Bishnoi Kodamana Scalable Gaussian processes physical, thermal, datasets.Mater. Adv. https://doi.org/10.1039/D0MA00764ACrossref 31Cassar networks.Acta 159: 249-256https://doi.org/10.1016/j.actamat.2018.08.022Crossref (69) 32Covarrubias Agüero Maureira Morelli Escobar Cuadra F. Peñafiel Von Marttens situ preparation osteogenic bionanocomposite scaffolds based aliphatic polyurethane bioactive nanoparticles.Mater. Eng. 96: 642-653https://doi.org/10.1016/j.msec.2018.11.085Crossref 33Han Stone-Weiss Goel Machine design controlled healthcare applications.Acta Biomater. 107: 286-298https://doi.org/10.1016/j.actbio.2020.02.037Crossref (24) 34Mauro Vargheese K.D. Y.Z. Accelerating functional modeling.Chem. 28: 4267-4277https://doi.org/10.1021/acs.chemmater.6b01054Crossref (139) 35Montazerian Model-driven glasses: dynamics learning.Int. 65: 297-321https://doi.org/10.1080/09506608.2019.1694779Crossref 36Ravinder Sridhara K.H. Jayadeva Deep aided rational glasses.Mater. Horiz. 1819-1827https://doi.org/10.1039/D0MH00162GCrossref 37Tandia Onbasli modeling.in: Calvez J.D.M.J.H.L. Springer Handbook Glass. Springer, 2019: 1157-1192Crossref (15) 38Yang Xu X. Yang Cook Ramos Hoover young’s modulus high-throughput simulations learning.Sci. Rep. 8739https://doi.org/10.1038/s41598-019-45344-3Crossref (59) Several works shared along trained ML models.28Alcobaça Scholar,30Bishnoi Scholar,36Ravinder Scholar,39Nabian Jahanshahi Rabiee Synthesis nano-bioactive glass–ceramic powders its vitro bioactivity bovine serum albumin protein.J. Mol. Struct. 2011; 998: 37-41https://doi.org/10.1016/j.molstruc.2011.05.002Crossref (31) For instance, software package, Python (PyGGi), database, nine key optimization targeted discovery.40Ravinder (PyGGi).2020https://pyggi.iitd.ac.inGoogle models, relied existing their training analysis,41Priven A.I. Mazurin O.V. databases: history, state, prospects further development.Adv. Mat. 2008; 39-40: 147-15210.4028/www.scientific.net/AMR.39-40.147Google hence restricted parameter predictions regression models. It well known nonequilibrium not just function composition, but fundamentally influenced history testing conditions.27Varsheneya Scholar,42Anderson P.W. Through lightly.Science. 1995; 267: 1609-1618https://doi.org/10.1126/science.267.5204.1615-eCrossref 43Kasimuthumaniyan Reddy A.A. Understanding role post-indentation recovery hardness case silica, borate, borosilicate glasses.J. 119955https://doi.org/10.1016/j.jnoncrysol.2020.119955Crossref 44Li Song Wang Sant Balonis Cooling sodium bridging gap between experiments.J. Phys. 147: 074501https://doi.org/10.1063/1.4998611Crossref combination algorithms, protocols, visualization tools, show very specific questions answered. include material/property broader issues, following:1.What microstructural characterizations glasses?2.Are papers published theoretical opposed experimental?3.Where americium glasses?4.What LEDs?5.Are photoluminescence contain Fluorine?6.Can find optical manufactured solid state synthesis? developed abstracts, text, To proposed approach, downloaded 600,000 articles, full texts, keyword “oxide glasses” “materials science” CrossRef metadata query API45Metadata - Crossref. (2020). https://www.crossref.org/education/retrieve-metadata/.Google Elsevier Science Direct API.46Elsevier Developer Portal. https://dev.elsevier.com/.Google Following this, supervised was performed manuscripts filter them (see Methods details). Abstracts information-dense organ paper, containing under study, explored, characterization/synthesis methods service investigation. As such, they unlikely spurious refer mentioned text. This specificity makes abstract useful part therefore surprising taken paper input.7Tshitoyan Based learning, were classified topic “glass,” precision, accuracy, recall 92%, 86%, 67%, respectively, test set. Note highest selected glass-related possible. Although exhaustive list, total number same identified surveys glass.47Mauro Philip C.S. Vaughn D.J. Pambianchi United States: current status future directions.Int. 2014; 2-15https://doi.org/10.1111/ijag.12058Crossref Scholar,48Mauro Two centuries historical trends, status, grand future.Int. 313-327https://doi.org/10.1111/ijag.12087Crossref (94) An unsupervised algorithm called allocation49Blei D.M. Ng A.Y. Jordan M.I. Latent dirichlet allocation.J. Mach. Learn. 2003; 993-1022https://doi.org/10.5555/944919.944937Crossref 15 “topics,” where each defined set words probability occurrence within topic. rapid organization minimal human supervision—a capability provided today. categories visualized Figure 1A. Each vectorized Term Frequency–Inverse Document Frequency50Jones K.S. Statistical Interpretation Specificity Application Retrieval. MCB UP Ltd, 1972https://doi.org/10.1108/eb026526Crossref (2197) (TFIDF), document higher dimensional space. T-distributed stochastic neighbor embedding (t-SNE) projects 2-dimensional (2D) plane cosine similarity group together. color pixel determined assigned LDA. seen immediately 1A points similar grouped suggests TFIDF vectorization followed t-SNE clustering topics, graphical field succinctly summarizes details earlier. descriptive label topics lexical established Topic 11 “er,” “yb,” “emission,” “doped,” “luminescence,” “nd,” “tm.” Analysis high-frequency experts luminescence doped rare earth ions, labeled “Rare Earth glasses.” schematically histogram 1B. descripti

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep Knowledge Discovery from Natural Language Texts

We introduce a knowledge-based approach to deep knowledge discovery from real-world natural language texts. Data mining, data interpretation, and data cleaning are all incorporated in cycles of quality-based terminological reasoning processes. The methodology we propose identifies new knowledge items and assimilates them into a continuously updated domain knowledge base.

متن کامل

THROUGH INFORMATION ASSOCIATION : A Knowledge Discovery Tool for Materials Science

The recent announcements of the discovery of materials exhibiting superconductivity (MgB2 – and a conjugated polymer, regioregular poly(3-hexylthiophene)) serve to highlight the fact that materials discovery is still a process that is often governed by empiricism and accidental discoveries. While incremental progress is made in specific technological areas of interest, we need to have a means o...

متن کامل

NATURAL LANGUAGE PRocEssiNG FOR INFORMATION RETRIEVAL AND KNOWLEDGE DISCOVERY

Natural Language Processing (NLP) is a powerful technology for the vital tasks of information retrieval (IR) and knowledge discovery (KD) which, in turn, feed the visualization systems of the present and future and enable knowledge workers to focus more of their time on the vital tasks of analysis and prediction. First, a definition of NLP. Natural language processing is a set of computational ...

متن کامل

Literature-based knowledge discovery in climate science

Climate change caused by anthropogenic activity is one of the biggest challenges of our time. Researchers are striving to understand the effects of global warming on the ecological systems of the oceans, and how these ecological systems influence the global climate, a line of research that is crucial in order to counteract or adapt to the effects of global warming. A major challenge that resear...

متن کامل

Through the Looking Glass

In 1986, when I first started to think about the invariant theory of finite groups, there existed only one superb reference, the article by R.P. Stanley[1]. Since that time, there has been an explosion of interest, with many books and articles published, and conferences held. For example, there are two books by Bernd Sturmfels[2],[3], David Benson’s book[4] Polynomial Invariants of Finite Group...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Patterns

سال: 2021

ISSN: ['2666-3899']

DOI: https://doi.org/10.1016/j.patter.2021.100290